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EXERCISE:  EXECUTION  GRAPHS 


Construct  an  execution  graph  for  the  signal  processing  scenario 
described  below. 


The  signal  processing  scenario  receives  a  frame  consisting  of  512*512 
pixels.  It  executes  setup  processing  to  register  and  calibrate  the  frame. 
Next  the  scenario  executes  filter  processing  to  remove  clutter  and  identify 
(up  to  1000)  tracks.  Finally,  the  scenario  executes  track  processing  to 
identify  (up  to  50)  objects  of  interest  from  among  the  tracks. 


Setup  processing: 

Setup  processing  executes  (expanded)  component  Frame_reg  then 
executes  (expanded)  component  Radiometric_corr.  Frame_reg  performs  a 
2D  interpolation  which  consists  of  a  loop  executed  once  per  pixel  that 
executes  several  arithmetic  computations*.  Radiometric_corr  performs  a  ID 
interpolation  which  consists  of  a  loop  executed  once  per  pixel  that  executes 
several  arithmetic  computations. 


Filter  processing: 

Filter  processing  executes  (expanded)  component  spatial Jllter  then 
executes  (expanded)  component  temporaljilter.  Spatial_filter  performs  a 
2D  convolution  which  consists  of  a  loop  executed  once  per  pixel  that 
executes  an  inner  loop  (9  times)  consisting  of  several  arithmetic 
computations  to  "smooth"  pixel  intensity  based  on  the  intensity  of  the  9 
adjacent  pixels.  Temporal_Jilter  performs  a  third-order  differencing  which 
consists  of  a  loop  executed  once  per  pixel  that  performs  two  steps:  first  it 
executes  several  arithmetic  computations  to  "weight"  pixels  based  on  their 
intensity  in  previous  frames,  then  it  executes  arithmetic  computations  to 
select  pixels  whose  "weight"  exceeds  a  threshhold. 


Track  processing: 

Track  processing  executes  (expanded)  component  fading_track  then 
executes  (expanded)  component  hough_transform,  then  executes 
(expanded)  component  candidate_selection. 

Fading_track  performs  a  third-order  differencing  which  consists  of  a 
loop  executed  once  per  track  consisting  of  several  arithmetic  computations 
to  eliminate  fading  tracks. 

Hough_transform  calculates  track  parameters  by  executing  a  loop  (once 
per  track)  with  an  inner  loop  executed  once  per  object.  The  inner  loop 
performs  two  steps:  first  it  executes  several  arithmetic  computations  to 
calculate  "rho",  then  executes  instructions  to  udate  a  rho  table. 

Component  Candidate _selection  executes  a  loop  (once  for  each  of  the 
100  bytes  of  information  stored  for  each  of  the  objects)  consisting  of  several 
computations  to  compute  a  threshhold  value. 


*  The  arithmetic  and  other  computations  will  be  specified  later. 
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F/A-18  Algorithm  Analysis 


The  following  case  study  information  was  prepared  by  a  member  of 
the  F/A-18  design  team.  It  demonstrates  the  type  of  information 
that  can  be  obtained  for  SPE  studies,  and  the  approach  he  took  in 
gathering  the  data.  We  are  grateful  for  the  effort  he  put  forth 
to  support  this  project. 


" . . . .  Estimating  the  word  count  for  each  algorithm  could  be  very 
difficult  so  I  decided  to  simplify  it  a  bit.  I  planned  to  count 
"high-level"  instructions  and  then  use  an  expansion  factor 
(high-level  to  assembly)  to  determine  the  total  number  of 
assembly  language  instructions. 


Determining  a  valid  expansion  factor  is  the  hard  part:  some 
high-level  instructions  can  be  represented  as  single  assembly 
instructions,  but  most  require  two  or  more  instructions. 
Eventually  I  came  up  with  the  following  method: 


Assignment 

Add/Subtract 

Multiply 

Divide 


Load  (e.g.  LD) ,  Store  (e.g.  SD) 

Load,  Add  (e.g.)  AD,  Rescale  (e.g.  LALD) ,  Store  (1) 
Load,  Multiply  (e.g.  MDR) ,  Rescale,  Store 
Load,  Rescale,  Divide  (e.g.  D) ,  Store 


Note  that  the  example  instructions  are  for  double  prec 
(32-bit)  integer  arithmetic.  The  divide  is  not  double 
Single  precision  divide  (it's  4x  as  fast).  The  AYK-14 
no  floating  point  unit  (i.e.  fixed  point  arithmetic  is 
throughout) . 


ision 

precision : 
XN-5  has 
used 
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To  "validate"  the  expansion  factor  I  used  data  from  a  previous 
project.  This  implemented  an  algorithm  which  was  first  modelled 
in  Fortran.  The  Fortran  statements  were  classified  and  counted: 


High-Level 
Assignment  59 

Add/Subtract  57 

Multiply  28 

Divide  9 

Total  153 


Est  Asm 
118 
228 
112 
36 

494  — *>  3.22  Est  Expansion 


The  project  actually  used  422  assembly  language  instructions 
( sorry  -  no  break  down  into  catagories )  which  results  in  an 
2.76  expansion  factor.  Because  the  project  concentrated  on 
optimizing  memory  usage,  I  think  that  2.75  is  a  little  low 
for  the  average  project.  I  believe  the  original  assumption  (1) 
will  work. 


The  expansion  factor  mentioned  from  high-level  statements  to 
assembly  level  statements  does  not  account  for  instructions 
which  require  two  machine  words.  These  words  don't  degrade 
execution  speed  (any  more  than  the  rates  listed  below)  but  they 
do  take  up  more  memory.  In  general,  there  is  a  20%-50%  increase 
from  instruction  count  to  memory  requirements  (e.g.  10 
instructions  may  take  12  to  15  machine  words ) . 

Estimated  execution  performance  for  the  assembly  language 
statements  is  listed  below  (in  microseconds): 


INSTR 

DESCRIPTION 

XN-5 

XN-6 

LD 

Load  Double 

2.49 

0.95 

SD 

Store  Double 

2.68 

2.10 

AD 

Add  Double 

2.73 

1.19 

LALD 

Left  Shift  Dbl 

1.89 

1.11 

MD 

Multiply  Double 

8.27 

4.07 

D 

Divide 

9.87 

4.38  (note  single  precision) 

L 

Load  Single 

2.24 

0.80 

S 

Store  Single 

1.86 

1.15 

A 

Add  Single 

2.24 

0.94 

LALS 

Left  Shift  Sngl 

1.47 

0.90 

M 

Multiply  Single 

5.40 

2.19 

The  Algorithms  - 

Both  algorithms  will  require  an  interface  to  the  current 
program.  This  interface  (setting  data  up  etc.)  has  been 
estimated  to  require  1000  assembly  words.  No  mix  of  statements 
has  been  given  so  I  had  to  guess.  My  guess  is  the  result 
of  looking  at  an  algorithm  intended  to  function  similarly 
to  the  two  candidate  algorithms.  I  counted  the  mix  of 
high-level  statements;  this  should  be  used  to  determine 
the  overall  makeup  of  this  interface.  I  would  assume  that 
all  of  these  instructions  run  during  each  pass. 


HOL 

%total 

37 

37% 

20 

20% 

Assignment 

Add/Subtract 
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(2) 


Multiply  31  31% 

Divide  12  12% 

Total  100 

This  interface  will  run  at  a  20Hz  rate.  It  sets  data  up  to  be 
used  by  the  algorithms  by  "filtering"  20Hz  data  to  be  used 
by  the  algorithms  at  1Hz.  Data  is  passed  from  the  20Hz  task  to 
the  1Hz  task  with  the  use  of  common  memory. 

Both  algorithms  are  called  only  in  certain  conditions:  the 
aircraft  "master  mode"  must  be  A/A;  the  radar  must  be  in 
"angle-only"  track;  and  the  radar  must  not  have  range 
information  available.  Essentially,  che  pilot  only  has 
angular  data  (position,  rates,  acceleration)  to  work  with. 

The  calling  conditions  for  the  algorithm  take  it  out  of 
the  "worst-case"  execution  path  (which  would  be  "Track  While 
Scan"  (TWS)  mode  with  10  targets).  In  the  future,  TWS  mode 
will  have  angle-only  track  data  for  up  to  ten  targets;  if 
we  tried  to  use  the  algorithm  on  all  the  targets,  this  would 
be  the  worst  case  path.  I  think  we  should  use  this  as  a 
seperate  case  within  the  study.  This  case  would  require  that 
the  "interface"  and  the  1Hz  task  execute  for  each  target. 

Initially  I  thought  that  TWS  would  have  a  looping  structure 
to  handle  up  to  ten  targets.  This  looping  structure  actually 
is  in  the  Tactical  Controls  and  Displays  ( TCD)  module.  The 
looping  structure  is  required  for  the  display  of  the  target 
data.  The  A/A  module  apparently  doesn't  compute  the  missile 
launch  parameters  for  each  target  -  only  for  the  Lock  and 
Shoot  (L&S)  target.  This  makes  modelling  the  A/A  module 
straight  forward:  it  is  essentially  a  straight  through  path 
(see  the  flow  chart).  Following  are  some  details  (I  have 
omitted  most  details): 


Block 

Assign 

Add/Sub  Mult 

Div 

B.l 

860 

0 

0 

0 

(mostly  setup  logic) 

B.  2 

317 

172 

266 

103 

(matrix  transforms) 

B.  3 

0 

0 

0 

0 

(new  20Hz  goes  here) 

B.  4 

340 

184 

285 

110 

(missile  calculations) 

B.  5 

363 

196 

304 

118 

(launch  zone  calcs) 

B.  6 

272 

147 

228 

88 

(missile  calcs) 

B.7 

640 

0 

0 

0 

(mostly  logic) 

NOTE:  The  above  instructions  are  single  precision,  not  double 
precision . 


ALGORITHM  #1 

I  had  the  best  data  for  algorithm  #1  (because  it  was  developed 
"in-house").  I  was  able  to  get  a  Fortran  listing  of  the  model 
and  some  additional  data.  Because  the  algorithm  has  an  internal 
looping  structure,  I  decided  to  "flowchart"  the  algorithm  and 
then  detail  the  number  of  high-level  statements  in  each  "block" . 
The  statement  counts  are  listed  below  (a  flowchart  should  be 
attached ) . 

Block  Assign  Add/Sub  Mult  Div 
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A.  1  20  0  .  5  3 

A.  2  15  17  1  5 

A. 3  see  numbers  for  EKF  "subroutine" 

A. 4  79  54  110  8 

A. 5  7  0  0  0 

E.l  7  3  4  2 

E.2  19  37  39  0 

E.3  14  0  0  0 

E. 4  28  56  14  0 

E.  5  27  17  25  2 

Total  216  184  198  20  ==>  618  "high-level" 

432  736  792  80  «=>  2040  assembly  language 

Note:  The  algorithm  requires  (at  least)  282  Data  Words 

The  flowcharts  for  algorithm  #1  indicate  a  loop  which  is 
bounded  by  NINT.  NINT  is  the  number  of  "integration"  steps 
which  are  used  by  the  Kalman  filter.  The  number  of  integration 
steps  is  determined  by  the  maximum  integration  step  size  and 
the  length  of  time  which  must  be  integrated  over.  For  this 
implementation  NINT  will  be  2. 

ALGORITHM  #2 

Algorithm  #2  has  little  data  (basically  taken  from  past  verbal 
discussions).  The  algorithm  uses  the  same  basic  method  used 
in  past  algorithm  implementations  with  a  little  better  handling 
of  initial  data. 

The  algorithm's  worst  case  path  is  also  it's  steady  state  path 
(at  least  it  is  very  close  to  being  the  worst  case  path).  The 
worst  case  path  will  be  close  to  1000  assembly  statements  long. 
Again,  we  have  no  mix  of  the  statements  so  we  must  assume  the 
percentage  mix  outlined  above  ( 2 ) . 

FINAL  NOTE: 

Algorithm  #1  will  be  implemented  in  future  OFP  releases.  The 
rational  for  doing  this  is  performance  of  the  algorithm  in 
various  non-optimum  situations.  Execution  time  and  algorithm 
size  were  not  much  of  an  issue  (since  both  algorithms  have 
the  same  estimated  order  of  magnitude  of  execution  time  and  size) . 
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Top  Level  -  Algorithm  01 


209 


Enter 


Block  B.3  is  where 
the  20Hz  interface 
will  reside. 


Return 


A/A  Processing  Flowchart 


NOTE:  There  is  no 
looping  structure  which 
I  thought  was  in  TWS. 
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performance  requirements.  It  begins  early  in  the  software  life  cycle  when  requirements  and  designs 
are  formulated.  By  modeling  and  assessing  the  predicted  performance  early,  one  can  choose  an 
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tuned)  and  the  productivity  of  the  development  staff  is  improved  (since  efforts  are  not  wasted  on 
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